Technical report accompanying the paper titled To the moon: finding fast-growing firms for investment opportunities¶
Author: Márton Nagy
Course: Data Analysis 3 - 2nd Home Assignment
The purpose of this document is to facilitate understanding the technical details of the accompanied summary report while providing general insight into the key decision points I faced during the analysis. Note that this report will generally only focus on the technical details of the analysis rather then the actual business case. The document includes many, but not all code snippets and outputs from the source code (sometimes in a somewhat modified form to ease understanding). Therefore, for an even more complete picture (and to reproduce the results), please refer to the source code directly, available at my public GitHub repository. Note that some cells have not been executed in this technical report (mostly if they would have modified some results that were already calculated and saved beforehand).
First, let's take a quick look at the necessary packages for this analysis:
- general packages for dealing with data and I/O:
import pandas as pd
import numpy as np
from time import time
from pathlib import Path
import dill
import json
import warnings
warnings.filterwarnings('ignore')
import requests
import io
import os
- packages to produce pretty visualizations and color palette:
from plotnine import *
from IPython.display import display
import patchworklib as pw
import matplotlib.pyplot as plt
import seaborn as sns
from mizani.formatters import percent_format
%matplotlib inline
color = ["#DB5461", "#080357", "#3B8EA5", "#3B8EA5", "#3B8EA5"]
- and lastly, packages needed for model building and evaluation:
import patsy
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import (
LogisticRegression,
LogisticRegressionCV,
)
from sklearn.metrics import (
auc,
brier_score_loss,
confusion_matrix,
mean_squared_error,
roc_auc_score,
roc_curve,
)
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from statsmodels.tools.eval_measures import rmse
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
import shap
The following parts of this document will be divided into two general sections: data preparation and model building & analysis. This division follows the division between the two submitted Jupyter-notebooks.
1. Data preparation¶
The raw datafile was downloaded from OSF directly. It is imported from my GitHub as OSF has a rather slow response time. It is an already cleaned panel dataset of firm level financial data. As instructed by the task description, I start with filtering the dataset to the years between 2010 and 2015.
data = pd.read_csv('https://github.com/marton-nagy-marton/Data-Analysis-3/raw/refs/heads/main/Home_Assignment_2/cs_bisnode_panel.zip')
data = data[(data['year'] >= 2010) & (data['year'] <= 2015)]
The next step is to drop features where there is a substantial amount of missing values. These are dropped as it would have been impossible to impute this many missing values. Also note that two other variables, exit_date and exit_year had just below 90% of missing values, so they were not dropped at this point - however, I did not end up using these during feature engineering, so they will be dropped eventually.
missing_pct = (data.isna().sum() / data.shape[0]) * 100
print(missing_pct[missing_pct > 90])
data = data.drop(columns = list(missing_pct[missing_pct > 90].index))
COGS 94.600432 finished_prod 94.778827 net_dom_sales 94.600432 net_exp_sales 94.600432 wages 94.690524 D 100.000000 dtype: float64
Note that the original panel was unbalanced in the sense that not all firms were observed across all the years. To resolve this issue (as it would have caused technical difficulties later), I added new rows for previously not included country-year observations. This way I ended up with 39375 unique companies observed over 6 years.
data = (
data.set_index(['year', 'comp_id'])
.unstack(fill_value='toReplace')
.stack()
.reset_index()
)
data = data.replace('toReplace', np.nan)
print(f'T:N = {len(data.year.unique())}:{len(data.comp_id.unique())}')
T:N = 6:39375
1.1. Label engineering¶
The business rationale behind using the 2-year sales CAGR as an indicator for fast growth has already been described in the summary paper. Now, let's focus more on the technical definition. Remember that CAGR is defined as:
$$ CAGR = (\frac{V_{final}}{V_{begin}})^{\frac{1}{t}} - 1 $$To calculate the 2-year CAGR of every company, I first define an indicator denoting whether a firm is alive in a given year (I borrowed this indicator from the case study covered in class). Next, I look for firms that were alive in $year_{t}$ and $year_{t+2}$ and calculate the 2-year CAGR on sales for these. Next I define the yearly average 2-year CAGR and merge it back with my original dataset. Then, I can finally define my binary indicator of fast growth: this indicator is set to 1 if:
- the firm was alive in both $year_{t}$ and $year_{t+2}$ (as otherwise there is no CAGR);
- and it had a 2-year CAGR higher than the yearly sample mean.
Note that initially, I thought about defining an even stricter threshold by e.g. classifying only those firms as fast-growth that exceed the mean CAGR by a certain number of standard deviation units. But as we will see, the distribution of 2-year CAGR values was very skewed, so simply comparing to the mean was already quite strict enough.
data['status_alive'] = (data['sales'] > 0 & (False == data['sales'].isna())).astype(int)
data = data.sort_values(by = ['comp_id', 'year'])
data['cagr_2y'] = np.where(
(data['status_alive'] == 1) & (data.groupby('comp_id')['status_alive'].shift(-2) == 1),
(data.groupby('comp_id')['sales'].shift(-2) / data['sales']) ** (1/2) - 1,
np.nan
)
cagr_stats = (
data.dropna(subset=['cagr_2y'])
.groupby('year')['cagr_2y']
.agg(['mean', 'std'])
.rename(columns={'mean': 'cagr_mean', 'std': 'cagr_std'})
)
data = data.merge(cagr_stats, on='year', how='left')
data['b_fast_growth'] = ((data['cagr_2y'] > (data['cagr_mean']))).astype(int)
The reinforce the ideas above, let's look at the distribution of calculated 2-year CAGR values in 2012. Note that as the CAGR values can be very extreme, I capped these at 25 for the chart so that we can better see the cut-off. What is quite apparent is that the bigger chunk of the distribution falls left to the mean value. All together, in 2012 only around 14% of the alive companies were classified as fast-growing (note that this is slightly bigger then the number included in the summary report - the reason is that after further data cleaning, the share of fast-growing companies dropped to around 12.4%).
data['cagr_2y_capped'] = data['cagr_2y'].apply(lambda x: x if x < 25 else 25)
display(
ggplot(data[(data['year'] == 2012) & (data['status_alive'] == 1)], aes(x='cagr_2y_capped'))
+ geom_histogram(bins=125, fill=color[0], color=None)
+ geom_vline(xintercept=data[data['year'] == 2012].cagr_mean.unique(), linetype='dashed', color='black', size = 0.5)
+ labs(title='Histogram of 2-year CAGR in 2012', x='2-year CAGR (capped at 25)', y='Frequency')
+ annotate('text', x=data[data['year'] == 2012].cagr_mean.unique()+0.5, y=3000, label='Mean 2-year CAGR in 2012', color='black', size=8, angle=90)
+ theme_bw()
)
data.drop(columns=['cagr_2y_capped'], inplace=True)
print(f'In 2012, {data[(data['year'] == 2012) & (data['status_alive'] == 1)].b_fast_growth.mean():.2%} of alive companies classified as fast-growing.')
In 2012, 14.07% of alive companies classified as fast-growing.
We can now filter the dataset for our year of interest (that is 2012) and the previous years. We shall not use any information regarding the future (as that is what we are trying to predict).
data = data[data['year'] <= 2012]
1.2. Feature engineering¶
For some parts of the remaining data preparation, I simply followed what we have done in the case study. For these parts, I will not provide a detailed explanation as I already expect the reader to be familiar with that code and logic.
First, I clean up the sales variable. I assign 1 to negative values then take the logarithm (0 sales is assigned zero for the logarithm). I do the same for sales transformed to million EUR units. I also add log-differences for the changes from 2011 to 2012 and from 2010 to 2011.
data["sales"] = np.where(
data["sales"] < 0, 1, data["sales"]
)
data = data.assign(
ln_sales=np.where(
data["sales"] > 0,
np.log(data["sales"]),
(np.where(data["sales"].isna(), np.nan, 0)),
), # NaN remain NaN
sales_mil=data["sales"] / 1000000,
ln_sales_mil=np.where(
data["sales"] > 0,
np.log(data["sales"] / 1000000),
(np.where(data["sales"].isna(), np.nan, 0)),
),
)
data["d1_ln_sales_mil"] = data["ln_sales_mil"] - data.groupby("comp_id")["ln_sales_mil"].shift(1)
data["d2_ln_sales_mil"] = data.groupby("comp_id")["ln_sales_mil"].shift(1) - data.groupby("comp_id")["ln_sales_mil"].shift(2)
Next I define the age of the firm as the difference between the year of the observation and the founding year of the firm. I create a binary indicator for the firm being a new one. For these new firms (and for those for which I could not create this indicator), I assign 0 as the log differences in sales.
data["age"] = np.where(
data["year"] - data["founded_year"] < 0, 0, data["year"] - data["founded_year"]
)
data["b_new"] = np.where(
((data["age"] <= 1) | (data["balsheet_notfullyear"] == 1)),
1,
(np.where(data["age"].isna(), np.nan, 0)),
)
data["d1_ln_sales_mil"] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data["d1_ln_sales_mil"]))
data["d2_ln_sales_mil"] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data["d2_ln_sales_mil"]))
data["b_new"] = np.where(data["d1_ln_sales_mil"].isna(), 1, data["b_new"])
data["d1_ln_sales_mil"] = np.where(data["d1_ln_sales_mil"].isna(), 0, data["d1_ln_sales_mil"])
data["d2_ln_sales_mil"] = np.where(data["d2_ln_sales_mil"].isna(), 0, data["d2_ln_sales_mil"])
I clean up the 2-digit industry codes a bit by joining certain categories.
data["ind2_cat"] = data["ind2"].copy()
data["ind2_cat"] = np.where(data["ind2"] > 56, 60, data["ind2_cat"])
data["ind2_cat"] = np.where(data["ind2"] < 26, 20, data["ind2_cat"])
data["ind2_cat"] = np.where((data["ind2"] < 55) & (data["ind2"] > 35), 40, data["ind2_cat"])
data["ind2_cat"] = np.where(data["ind2"] == 31, 30, data["ind2_cat"])
data["ind2_cat"] = np.where(data["ind2"].isna(), 99, data["ind2_cat"])
Now we can look at the distribution of some categorical variables. As some variables are present both as a continous variable and as a categorical one, the key thing here is to decide with which version we shall work further. Looking at the distributions, we can see that for both the gender and the origin of the CEO-s, the categorical variables capture quite well the continous variables. Also, for the length of the balance sheet, it is well-captured by the simple dummy denoting whether the balance sheet is for a full year or not.
pw.basefigure.clear()
p1 = (
ggplot(data, aes(x="foreign"))
+ geom_histogram(bins = 5, fill=color[0])
+ labs(title="Histogram of foreign management share", x="Foreign management (share)", y="Frequency")
+ theme_bw()
)
p2 = (
ggplot(data[data['origin'].notna()], aes(x="origin"))
+ geom_bar(fill=color[0])
+ labs(title="Histogram of management origin cat.", x="Management origin cat.", y="Frequency")
+ theme_bw()
)
display(pw.load_ggplot(p1) | pw.load_ggplot(p2))
pw.basefigure.clear()
p1 = (
ggplot(data, aes(x="female"))
+ geom_histogram(bins = 5, fill=color[0])
+ labs(title="Histogram of female management share", x="Female management (share)", y="Frequency")
+ theme_bw()
)
p2 = (
ggplot(data[data['gender'].notna()], aes(x="gender"))
+ geom_bar(fill=color[0])
+ labs(title="Histogram of management gender cat.", x="Management gender cat.", y="Frequency")
+ theme_bw()
)
display(pw.load_ggplot(p1) | pw.load_ggplot(p2))
(
ggplot(data[data['region_m'].notna()], aes(x="region_m"))
+ geom_bar(fill=color[0])
+ labs(title="Histogram of firm regions", x="Firm region", y="Frequency")
+ theme_bw()
)
(
ggplot(data[data['urban_m'].notna()], aes(x="urban_m"))
+ geom_bar(fill=color[0])
+ labs(title="Histogram of firm HQ urban location category", x="Firm HQ urban location category", y="Frequency")
+ theme_bw()
)
pw.basefigure.clear()
p1 = (
ggplot(data, aes(x="balsheet_length"))
+ geom_histogram(bins = 5, fill=color[0])
+ labs(title="Histogram of balance sheet length", x="Balance sheet length (days)", y="Frequency")
+ theme_bw()
)
p2 = (
ggplot(data, aes(x="balsheet_notfullyear"))
+ geom_bar(fill=color[0])
+ labs(title="Histogram of balance sheet not full year dummy", x="Balance sheet not full year (1 = not full year)", y="Frequency")
+ theme_bw()
)
display(pw.load_ggplot(p1) | pw.load_ggplot(p2))
Based on the above, we can adjust certain variables to categorical type. Note that here, I also create a dummy for foreign and female management, as well as a categorical outcome variable, which will not be used later (as decided during the analysis).
data["b_foreign_management"] = np.where(data["foreign"] >= 0.5, 1, np.where(data["foreign"].isna(), np.nan, 0))
data["b_female_management"] = np.where(data["female"] >= 0.5, 1, np.where(data["female"].isna(), np.nan, 0))
data["f_gender"] = data["gender"].astype("category")
data["f_region_m"] = data["region_m"].astype("category")
data["f_urban_m"] = data["urban_m"].astype("category")
data["f_ind2_cat"] = data["ind2_cat"].astype("category")
data['f_origin'] = data['origin'].astype('category')
data['f_ind'] = data['ind'].astype('category')
data["f_fast_growth"] = data["b_fast_growth"].astype("category")
data["f_fast_growth"] = data["f_fast_growth"].cat.rename_categories(["slow_growth", "fast_growth"])
data.rename(columns = {'balsheet_notfullyear' : 'flag_balsheet_notfullyear'}, inplace = True)
After the categorical variables, we can now turn our attention towards the most important asset variables: intangible assets, current assets, fixed assets and total assets. For these, I add a flag if they happen to be negative. Next, I follow the logic applied to sales before: I take logs the same way and add log differences for the two years before 2012. I also added simple level changes for the most important P&L variables, as I hypothesized that these also may be an important determinant of future growth rates.
data["flag_asset_problem"] = np.where(((data["intang_assets"] < 0) | (data["curr_assets"] < 0) | (data["fixed_assets"] < 0)), 1, 0)
data["flag_asset_problem"] = np.where(((data["intang_assets"].isna()) | (data["curr_assets"].isna()) | (data["fixed_assets"].isna())), np.nan, data["flag_asset_problem"])
data["intang_assets"] = np.where(data["intang_assets"] < 0, 1, data["intang_assets"])
data["curr_assets"] = np.where(data["curr_assets"] < 0, 1, data["curr_assets"])
data["fixed_assets"] = np.where(data["fixed_assets"] < 0, 1, data["fixed_assets"])
data["total_assets_bs"] = (data["intang_assets"] + data["curr_assets"] + data["fixed_assets"])
data = data.assign(
ln_intang_assets=np.where(
data["intang_assets"] > 0,
np.log(data["intang_assets"]),
(np.where(data["intang_assets"].isna(), np.nan, 0)),
),
ln_curr_assets=np.where(
data["curr_assets"] > 0,
np.log(data["curr_assets"]),
(np.where(data["curr_assets"].isna(), np.nan, 0)),
),
ln_fixed_assets=np.where(
data["fixed_assets"] > 0,
np.log(data["fixed_assets"]),
(np.where(data["fixed_assets"].isna(), np.nan, 0)),
),
ln_total_assets=np.where(
data["total_assets_bs"] > 0,
np.log(data["total_assets_bs"]),
(np.where(data["total_assets_bs"].isna(), np.nan, 0)),
),
)
asset_vars = ['ln_intang_assets', 'ln_curr_assets', 'ln_fixed_assets', 'ln_total_assets']
for v in asset_vars:
data[f'd1_{v}'] = data[v] - data.groupby("comp_id")[v].shift(1)
data[f'd2_{v}'] = data.groupby("comp_id")[v].shift(1) - data.groupby("comp_id")[v].shift(2)
data[f'd1_{v}'] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data[f'd1_{v}']))
data[f'd2_{v}'] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data[f'd2_{v}']))
data[f'd1_{v}'] = np.where(data[f'd1_{v}'].isna(), 0, data[f'd1_{v}'])
data[f'd2_{v}'] = np.where(data[f'd2_{v}'].isna(), 0, data[f'd2_{v}'])
pl_vars = ["inc_bef_tax", "inventories", "material_exp", "profit_loss_year", "personnel_exp",]
for v in pl_vars:
data[f'd1_{v}'] = data[v] - data.groupby("comp_id")[v].shift(1)
data[f'd2_{v}'] = data.groupby("comp_id")[v].shift(1) - data.groupby("comp_id")[v].shift(2)
data[f'd1_{v}'] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data[f'd1_{v}']))
data[f'd2_{v}'] = np.where(data["b_new"] == 1, 0, np.where(data["b_new"].isna(), np.nan, data[f'd2_{v}']))
data[f'd1_{v}'] = np.where(data[f'd1_{v}'].isna(), 0, data[f'd1_{v}'])
data[f'd2_{v}'] = np.where(data[f'd2_{v}'].isna(), 0, data[f'd2_{v}'])
The next thing to do is to create financial ratio variables. For this, I define key P&L and balance sheet variables, and simply divide them with sales or total assets, respectively. Now even though most of these variables should be either in a -1 to +1 or a 0 to +1 range, this may not be the case in this dataset. Thus, I floor and cap these variables with the respective ranges, while adding appropriate flags to the data to account for these changes.
pl_names = ["extra_exp","extra_inc","extra_profit_loss","inc_bef_tax","inventories","material_exp","profit_loss_year","personnel_exp"]
bs_names = ["intang_assets","curr_liab","fixed_assets","liq_assets","curr_assets","share_eq","subscribed_cap","tang_assets"]
data[[col + "_pl" for col in pl_names]] = data[pl_names].div(data["sales"], axis="index")
data[[col + "_bs" for col in bs_names]] = (
data[bs_names]
.div(data["total_assets_bs"], axis="index")
.replace((np.inf, -np.inf, np.nan), (0, 0, 0))
)
for col in bs_names:
data[col + "_bs"] = np.where(
data["total_assets_bs"].isna(), np.nan, data[col + "_bs"]
)
zero = ["extra_exp_pl","extra_inc_pl","inventories_pl","material_exp_pl","personnel_exp_pl","curr_liab_bs","fixed_assets_bs","liq_assets_bs",
"curr_assets_bs","subscribed_cap_bs","intang_assets_bs"]
data[[col + "_flag_high" for col in zero]] = np.where(
data[zero].isna(), np.nan, (data[zero] > 1).astype(int)
)
data[[col for col in zero]] = np.where(
data[zero].isna(), np.nan, np.where(data[zero] > 1, 1, data[zero])
)
data[[col + "_flag_error" for col in zero]] = np.where(
data[zero].isna(), np.nan, (data[zero] < 0).astype(int)
)
data[[col for col in zero]] = np.where(
data[zero].isna(), np.nan, np.where(data[zero] < 0, 0, data[zero])
)
anyof = ["extra_profit_loss_pl", "inc_bef_tax_pl", "profit_loss_year_pl", "share_eq_bs"]
data[[col + "_flag_low" for col in anyof]] = np.where(
data[anyof].isna(), np.nan, (data[anyof] < -1).astype(int)
)
data[[col for col in anyof]] = np.where(
data[anyof].isna(), np.nan, np.where((data[anyof] < -1), -1, data[anyof])
)
data[[col + "_flag_high" for col in anyof]] = np.where(
data[anyof].isna(), np.nan, (data[anyof] > 1).astype(int)
)
data[[col for col in anyof]] = np.where(
data[anyof].isna(), np.nan, np.where((data[anyof] > 1), 1, data[anyof])
)
data[[col + "_flag_zero" for col in anyof]] = np.where(
data[anyof].isna(), np.nan, (data[anyof] == 0).astype(int)
)
As now I had all the differenced variables I wanted, I filtered to a cross-section of alive firms 2012 only. Importantly, I filtered my sample to have sales between 1000 and 10 million EUR in 2012, as I deemed smaller or larger companies would be inappropriate for my business case.
data = data[(data['year'] == 2012) & (data['status_alive'] == 1)]
data = data[(data['sales_mil'] <= 10) & (data['sales_mil'] >= 0.001)]
Next I create a variable with the calculated age of the CEO. As this has some extremely low and high values, I cap and floor the values above 75 and below 25 years (and also add a flag for the changes). Where there were missing values in CEO age, I imputed them with the mean. I also create a dummy variable for young CEOs.
data["ceo_age"] = data["year"] - data["birth_year"]
data = data.assign(
flag_low_ceo_age=(data["ceo_age"] < 25).astype(int),
flag_high_ceo_age=(data["ceo_age"] > 75).astype(int),
flag_miss_ceo_age=(data["ceo_age"].isna()).astype(int),
)
data["ceo_age"] = np.where(data["ceo_age"] < 25, 25, data["ceo_age"])
data["ceo_age"] = np.where(data["ceo_age"] > 75, 75, data["ceo_age"])
data["ceo_age"] = np.where(
data["ceo_age"].isna(), data["ceo_age"].mean(), data["ceo_age"]
)
data["b_ceo_young"] = (data["ceo_age"] < 40).astype(int)
(
ggplot(data, aes(x="ceo_age"))
+ geom_histogram(bins = 25, fill=color[0])
+ labs(title="Histogram of CEO age", x="CEO age", y="Frequency")
+ theme_bw()
)
There were a substantial amount of missing values in the average number of employees. I imputed these with the sample mean and added a flag.
data["labor_avg_mod"] = np.where(data["labor_avg"].isna(), data["labor_avg"].mean(), data["labor_avg"])
data["flag_miss_labor_avg"] = (data["labor_avg"].isna()).astype(int)
data = data.drop(["labor_avg"], axis=1)
Next, I dropped observations which still had missing values in: liquid assets per total assets, foreign CEO share, industry code, firm age, material expenditures per sales, region of the firm.
data = data.dropna(subset=["liq_assets_bs", "foreign", "ind"])
data = data.dropna(subset=["age", "foreign", "material_exp_pl", "f_region_m"])
As CEO count had a pretty uneven distribution, I decided to include it as a categorical variable rather than a continous one. I made bins for 1, 2 and more than 2 CEOs.
print(data.ceo_count.value_counts())
f_ceo = pd.cut(data['ceo_count'].to_list(),
pd.IntervalIndex.from_tuples([(0, 2), (2, 3), (3, max(data.ceo_count)+1)], closed="left"),
labels=['1', '2', '2+'])
f_ceo = f_ceo.rename_categories([0, 1, 2])
data['f_ceo_count'] = f_ceo
ceo_count 1.0 14297 2.0 4189 3.0 470 4.0 65 5.0 10 6.0 4 15.0 1 Name: count, dtype: int64
Before moving on the winsorization, I clean up my dataset a bit: I remove unused categories, drop flags with no variation, and drop columns which are no longer needed.
for col in data.select_dtypes(include=["category"]).columns:
data[col] = data[col].cat.remove_unused_categories()
flag_columns = [col for col in data.columns if "flag" in col]
data = data.drop(
data[flag_columns].std()[(data[flag_columns].std() == 0)].index, axis=1
)
data.drop(columns = ['begin', 'end', 'founded_year', 'founded_date', 'exit_year', 'exit_date', 'birth_year',
'cagr_2y', 'cagr_mean', 'cagr_std', 'status_alive', 'ind', 'ind2', 'ind2_cat', 'nace_main',
'origin', 'urban_m', 'region_m', 'ceo_count', 'foreign', 'female', 'gender', 'balsheet_length'], inplace = True)
Let's create winsorized variables now. This is needed as most of the variables have very few observations at the ends of the distributions, which could heavily influence simple logit or LASSO logit resutlts. As I have 66 variables, I resort to the simpler approach: the winsorization frame is defined by the 5th and 95th percentiles. Unwinsorized LOWESS estimates are not included as they took up a lot of space, but they validated the need for winsorization. I only winsorize variables for which I did not set upper and lower bounds before. For those that have a 0 minimum, I only winsorize on the higher end.
num_cols = []
for c in data.columns:
if 'flag' not in c and c[0:2] != 'b_' and c[0:2] != 'f_' and c not in ['year', 'comp_id']:
num_cols.append(c)
print(f'Number of numeric columns: {len(num_cols)}')
hl_winsor = list(set(num_cols) - set(anyof) - set(['ceo_age']) - set(zero))
h_winsor = [c for c in hl_winsor if data[c].min() == 0]
hl_winsor = list(set(hl_winsor) - set(h_winsor))
toplot = []
for v in hl_winsor:
low = np.quantile(data[v], 0.05)
high = np.quantile(data[v], 0.95)
data[f'flag_low_{v}'] = np.where(data[v] < low, 1, 0)
data[f'flag_high_{v}'] = np.where(data[v] > high, 1, 0)
data[f'{v}_wins']= np.where(
data[v] < low,
low,
np.where(
data[v] > high,
high,
data[v])
)
toplot.append(f'{v}_wins')
for v in h_winsor:
high = np.quantile(data[v], 0.95)
data[f'flag_high_{v}'] = np.where(data[v] > high, 1, 0)
data[f'{v}_wins']= np.where(
data[v] > high,
high,
data[v])
toplot.append(f'{v}_wins')
toplot = toplot + zero + anyof + ['ceo_age']
toplot.sort()
Number of numeric columns: 66
Though not shown here (as they would take up an unnecessary amount of space), I have examined the LOWESS estimates between the binary outcome and all my quantitative variables, and I have seen numerous non-linear patterns. Thus, I have decided to include squared and cubed variables in my final dataset to account for these functional forms.
Those interested in the LOWESS plots, please refer directly to the source code.
for v in toplot:
data[f'{v}_sq'] = np.power(data[v], 2)
data[f'{v}_cu'] = np.power(data[v], 3)
I reorder my columns according to a pre-defined order, to ensure that my variable order is the same after each execution. This is important as tree based models are may produce different results with different variable orders.
The source of the possibly different ordering after each execution is actually because of the set() operators I use in my code, as these order the passed lists in an undeterministic way. For more detail, see e.g. this article.
url = "https://github.com/marton-nagy-marton/Data-Analysis-3/raw/refs/heads/main/Home_Assignment_2/feature_order.dill"
response = requests.get(url)
feature_order = dill.load(io.BytesIO(response.content))
data = data.filter(feature_order)
Lastly, we can export the resulting dataset for later use.
with open('HA2_workfile.dill', "wb") as f:
dill.dump(data, f)
2. Model building and analysis¶
Having the cleaned dataset in place with appropriately engineered features and a well-defined target variable, we can now turn our attention towards the actual model building and the analysis of these models. I import the prepped data from a saved .dill file (which is the same as the contents of the data variable in the above script).
with open('HA2_workfile.dill', 'rb') as f:
df = dill.load(f)
I defined some helper functions (mainly for creating uniform visualizations), based on the case study code and the utility functions of the textbook. At some points, I had to tweak the functions a bit so that my plots are more readable and are adjusted for my use case.
def seq(start: float, stop: float, by: float, round_n=3):
epsilon = np.finfo("float").eps
return [round(x, round_n) for x in list(np.arange(start, stop + (by - epsilon), by))]
def create_calibration_plot(data: pd.DataFrame, prob_var: str, actual_var: str,
y_lab='Actual probability of fast growth', x_lab='Predicted probability of fast growth', n_bins=10, breaks=None,):
if breaks is None:
breaks = np.around(
np.linspace(0, (n_bins + 1) / 10, num=n_bins + 1, endpoint=False),
decimals=1,
).tolist()
data["prob_bin"] = pd.cut(data[prob_var], breaks, right=True, include_lowest=True)
binned_data = (
data.groupby("prob_bin")
.agg(
mean_prob=(prob_var, "mean"),
mean_actual=(actual_var, "mean"),
n=(actual_var, "size"),
)
.reset_index()
)
return (
ggplot(binned_data, aes("mean_prob", "mean_actual"))
+ geom_segment(
x=min(breaks),
xend=max(breaks),
y=min(breaks),
yend=max(breaks),
color=color[1],
size=0.5,
)
+ geom_line(color=color[0], size=1.5, show_legend=True)
+ geom_point(color=color[0], size=2, show_legend=False, na_rm=True)
+ theme_bw()
+ labs(x=x_lab, y=y_lab)
+ coord_cartesian(xlim=(0, 1), ylim=(0, 1))
+ expand_limits(x=0.01, y=0.01)
+ scale_y_continuous(expand=(0.01, 0.01), breaks=(seq(0, 1.1, 0.1)))
+ scale_x_continuous(expand=(0.01, 0.01), breaks=(seq(0, 1.1, 0.1)))
)
def cv_summary(lambdas, C_values, model):
d = {
'lambdas': lambdas,
'C_values': C_values,
'mean_cv_score': model.scores_[1].mean(axis=0),
}
return pd.DataFrame(data=d)
def create_roc_plot(y_true, y_pred):
fpr, tpr, thresholds = roc_curve(y_true, y_pred)
all_coords = pd.DataFrame({'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds})
plot = (
ggplot(all_coords, aes(x='fpr', y='tpr'))
+ geom_line(color=color[0], size=0.7)
+ geom_area(position='identity', fill=color[1], alpha=0.3)
+ xlab('False positive rate (1 - specificity)')
+ ylab('True positive rate (sensitivity)')
+ geom_abline(intercept=0, slope=1, linetype='dotted', color='black')
+ scale_y_continuous(limits=(0, 1), breaks=seq(0, 1, 0.1), expand=(0, 0.01))
+ scale_x_continuous(limits=(0, 1), breaks=seq(0, 1, 0.1), expand=(0.01, 0))
+ theme_bw()
)
return plot
def create_loss_plot(all_coords, optimal_threshold, curr_exp_loss):
all_coords_copy = all_coords.copy()
all_coords_copy['loss'] = (
all_coords_copy.false_pos * FP + all_coords_copy.false_neg * FN
) / all_coords_copy.n
t = optimal_threshold
l = curr_exp_loss
plot = (
ggplot(all_coords_copy, aes(x='thresholds', y='loss'))
+ geom_line(color=color[0], size=0.7)
+ scale_x_continuous(breaks=seq(0, 1.1, by=0.1))
+ coord_cartesian(xlim=(0, 1))
+ geom_vline(xintercept=t, color=color[1])
+ annotate(
geom='text',
x=t - 0.01,
y=max(all_coords_copy.loss)*0.7,
label='Best threshold: ' + str(round(t, 4)),
colour=color[1],
angle=90,
size=8,
)
+ annotate(geom='text', x=t + 0.1, y=l, label=f'Loss: {str(round(l, 4))}', size=8)
+ labs(x='Threshold', y='Loss')
+ theme_bw()
)
return plot
def create_roc_plot_with_optimal(all_coords, optimal_threshold):
all_coords_copy = all_coords.copy()
all_coords_copy['sp'] = 1 - all_coords_copy.true_neg / all_coords_copy.neg
all_coords_copy['se'] = all_coords_copy.true_pos / all_coords_copy.pos
best_coords = all_coords_copy[all_coords_copy.thresholds == optimal_threshold]
sp = best_coords.sp.values[0]
se = best_coords.se.values[0]
plot = (
ggplot(all_coords_copy, aes(x='sp', y='se'))
+ geom_line(color=color[0], size=0.7)
+ scale_y_continuous(breaks=seq(0, 1.1, by=0.1))
+ scale_x_continuous(breaks=seq(0, 1.1, by=0.1))
+ geom_point(data=pd.DataFrame({'sp': [sp], 'se': [se]}))
+ annotate(
geom='text',
x=sp - 0.06,
y=se + 0.06,
label= f'FP rate: {str(round(sp, 3))},\nTP rate: {str(round(se, 3))}',
size=7,
)
+ geom_area(position='identity', fill=color[1], alpha=0.3)
+ xlab('False positive rate (1 - specificity)')
+ ylab('True positive rate (sensitivity)')
+ geom_abline(intercept=0, slope=1, linetype='dotted', color='black')
+ theme_bw()
)
return plot
As mentioned before, I ended up not using some of the variables created during data preparation. Instead of the dummies for foreign and female management, I opted for the corresponding categorical variables. Instead of the variables with sales in EUR, I opted for the sales in million EUR (note that this does not make any practical difference in my results, as the information content of the two is the same). Finally, instead of the categorical target variable, I simply used the binary one.
df.drop(columns=['b_foreign_management', 'b_female_management'], inplace=True)
df.drop(columns = [c for c in df.columns if 'sales' in c and 'sales_mil' not in c], inplace=True)
df.drop(columns=['f_fast_growth'], inplace=True)
The last thing to do before turning our attention towards modeling was to split the data into a training and a hold-out set. As my full sample had 19 thousand observations, I decided to do a random 80-to-20 split. Thus my training set had around 15 thousand, while the hold-out set around 4 thousand observations. As we can see, the proportion of high-growth companies in the two sets is almost identical, so the split indeed looks random.
data_train, data_holdout = train_test_split(df, train_size=0.8, random_state=1234)
print('Share of fast-growth companies in...')
print(f'\n...full sample (N = {df.shape[0]}):')
print(df['b_fast_growth'].value_counts(normalize=True))
print(f'\n...training set (N = {data_train.shape[0]}):')
print(data_train['b_fast_growth'].value_counts(normalize=True))
print(f'\n...hold-out set (N = {data_holdout.shape[0]}):')
print(data_holdout['b_fast_growth'].value_counts(normalize=True))
Share of fast-growth companies in... ...full sample (N = 19036): b_fast_growth 0 0.875184 1 0.124816 Name: proportion, dtype: float64 ...training set (N = 15228): b_fast_growth 0 0.875033 1 0.124967 Name: proportion, dtype: float64 ...hold-out set (N = 3808): b_fast_growth 0 0.875788 1 0.124212 Name: proportion, dtype: float64
2.1. Feature sets for different models¶
I have decided that the three models I wanted to experiment with were a LASSO logit, a probability forest and a probability XGBoost. Thus, I had to define the appropriate sets of input features for these models - one set for LASSO logit, and one other set for the tree-based models.
Below, I create 9 sets of variables with the following content:
raw_flags: flags that were created during data cleaning (e.g. for imputations), does not include flags for winsorization;categories: categorical variables;binary: binary variables;raws: untransformed but cleaned variables (so no logs or winsorization, but may include imputed values, capped and floored variables);non_wins_diffs: differences of non-winsorized variables (either simple or log differences);numerics: all continous variables (with winsorization and log transformation where applicable);squares: squared versions of thenumericsset;cubes: cubed versions of thenumericsset;wins_flags: flag variables created during winsorization.
# flags created during data cleaning
raw_flags = [c for c in data_train.columns if
'flag' in c
and not c.startswith('flag_low_')
and not c.startswith('flag_high_')] + ['flag_low_ceo_age', 'flag_high_ceo_age']
# categorical variables
categories = [c for c in data_train.columns if c.startswith('f_') and c != 'f_fast_growth']
# binary variables
binary = [c for c in data_train.columns if c.startswith('b_') and c != 'b_fast_growth']
# raw, untransformed variables
raws = [c for c in data_train.columns if
not c.startswith('ln_')
and 'flag' not in c
and not c.endswith('_wins')
and not c.startswith('d1_')
and not c.startswith('d2_')
and not c.startswith('f_')
and not c.startswith('b_')
and not c.endswith('_sq')
and not c.endswith('_cu')
and c not in ['year', 'comp_id']]
# diffs for non-winsorized variables
non_wins_diffs = [c for c in data_train.columns if
(c.startswith('d1_') or c.startswith('d2_'))
and not c.endswith('_wins')
and not c.endswith('_sq')
and not c.endswith('_cu')]
# temporary helper lists
squares = [c for c in data_train.columns if c.endswith('_sq')]
cubes = [c for c in data_train.columns if c.endswith('_cu')]
h = [c[:-3] for c in squares]
# numeric columns (winsorized if applicable)
numerics = [c for c in h if 'ln_' + c not in h]
# squares and cubes of numeric columns
squares = [c + '_sq' for c in numerics]
cubes = [c + '_cu' for c in numerics]
# temporary helper list
h2 = ['flag_high_' + c.replace('_wins', '') for c in numerics] + ['flag_low_' + c.replace('_wins', '') for c in numerics]
# flags created during winsorization
wins_flags = [c for c in h2 if c in data_train.columns]
I also define a very broad set of interaction terms for the LASSO logit model. I interact the 2-digit industry code with age, CEO age, origin and gender of the CEOs and the urban category of the firm. I also interact the above features with log sales, log total assets and profit/loss (and their two differences). Lastly, I interact the above continuous features (and their two differences) with each other.
interactions = [
'C(f_ind2_cat)*age_wins',
'C(f_ind2_cat)*age_wins_sq',
'C(f_ind2_cat)*age_wins_cu',
'C(f_ind2_cat)*ceo_age',
'C(f_ind2_cat)*ceo_age_sq',
'C(f_ind2_cat)*ceo_age_cu',
'C(f_ind2_cat)*C(f_origin)',
'C(f_ind2_cat)*C(f_gender)',
'C(f_ind2_cat)*C(f_urban_m)',
'C(f_ind2_cat)*labor_avg_mod',
'(ln_sales_mil_wins+d1_ln_sales_mil_wins+d2_ln_sales_mil_wins)'+\
'*(age_wins+age_wins_sq+age_wins_cu+ceo_age+ceo_age_sq+ceo_age_cu+'+\
'C(f_ind2_cat)+C(f_gender)+C(f_origin)+C(f_urban_m))',
'(ln_total_assets_wins+d1_ln_total_assets_wins+d2_ln_total_assets_wins)'+\
'*(age_wins+age_wins_sq+age_wins_cu+ceo_age+ceo_age_sq+ceo_age_cu+'+\
'C(f_ind2_cat)+C(f_gender)+C(f_origin)+C(f_urban_m))',
'(profit_loss_year_wins+d1_profit_loss_year_wins+d2_profit_loss_year_wins)'+\
'*(age_wins+age_wins_sq+age_wins_cu+ceo_age+ceo_age_sq+ceo_age_cu+'+\
'C(f_ind2_cat)+C(f_gender)+C(f_origin)+C(f_urban_m))',
'ln_sales_mil_wins*ln_total_assets_wins',
'ln_sales_mil_wins*profit_loss_year_wins',
'ln_total_assets_wins*profit_loss_year_wins',
'd1_ln_sales_mil_wins*d1_ln_total_assets_wins',
'd1_ln_sales_mil_wins*d1_profit_loss_year_wins',
'd1_ln_total_assets_wins*d1_profit_loss_year_wins',
'd2_ln_sales_mil_wins*d2_ln_total_assets_wins',
'd2_ln_sales_mil_wins*d2_profit_loss_year_wins',
'd2_ln_total_assets_wins*d2_profit_loss_year_wins',
]
The input features for LASSO logit are practically all features created during data preparation (with only log transformed and winsorized variables where applicable, with appropriate flags). For the tree-based models, the input features consisted of raw vars only (that is, without logs and without winsorization), with the inclusion of appropriate flags created during data preparation.
lasso_vars = numerics + squares + cubes + raw_flags + wins_flags + categories + binary + interactions
tree_vars = raws + non_wins_diffs + raw_flags + categories + binary
Having defined the variable sets, I could now create the appropriate design matrices for each model set-up. By encoding categorical features as a set of binary variables (dropping one category of course for each), the LASSO logit had 602 input features, while the tree based models had 122.
I also defined the set of lambda parameters (with corresponding C-values) for LASSO logit here, based on the case study. These turned out to be appropriate in this case, too.
lasso_model_equation = 'b_fast_growth~' + '+'.join(lasso_vars)
y_train_lasso, X_train_lasso = patsy.dmatrices(lasso_model_equation, data_train)
lasso_scaler = StandardScaler().fit(X_train_lasso)
X_train_lasso_scaled = pd.DataFrame(lasso_scaler.transform(X_train_lasso), columns=X_train_lasso.design_info.column_names)
lambdas = list(10 ** np.arange(-1, -4.01, -1 / 3))
n_obs = X_train_lasso_scaled.shape[0] * 4 / 5
C_values = [1 / (l * n_obs) for l in lambdas]
tree_model_equation = 'b_fast_growth~' + '+'.join(tree_vars)
y_train_tree = pd.DataFrame(data_train['b_fast_growth'])
X_train_tree = pd.get_dummies(data_train[tree_vars], drop_first=False)
X_train_tree.columns = [col.replace("[", "(").replace("]", ")").replace("<", "") for col in X_train_tree.columns]
for c in [c for c in X_train_tree.columns if c.startswith('f_')]:
X_train_tree[c] = X_train_tree[c].astype(int)
print(f'Number of input features for LASSO logit (categoricals encoded as binary, one category dropped for each): {X_train_lasso_scaled.shape[1]}')
print(f'Number of input features for tree-based models (categoricals one-hot-encoded as binary): {X_train_tree.shape[1]}')
Number of input features for LASSO logit (categoricals encoded as binary, one category dropped for each): 602 Number of input features for tree-based models (categoricals one-hot-encoded as binary): 122
Before building the actual models, I defined a basic 5-fold cross-validator that I will use throughout the model building.
k = KFold(n_splits=5, shuffle=True, random_state=42)
2.2. Model building¶
Note: The below codes were executed only once, and trained models were saved to .dill files. Therefore, the following code snippets are not executed here - they are only included to help explain how the training took place. I will import the trained models later on to perform the required analytics on them.
As I wanted to measure model training times, I created a dictionary where I logged the results.
model_times = {
'lasso_logit' : -9,
'prob_forest' : -9,
'prob_xgb' : -9
}
To train the LASSO logit models, I cross-validated them on the C-values defined above. As LogisticRegressionCV cannot estimate two scoring metrics at once - but I wanted to have both negative Brier-score and AUC values -, I had to fit the model two times, based on the two metrics. I used the saga solver, as it turned out to be the fastest method.
Note that for all models, I set the refit parameter to true - this means that the best model found through cross-validation was automatically refitted on the whole training set, thus the results can directly be used for prediction.
lasso_model_brier = LogisticRegressionCV(
Cs=C_values,
penalty='l1',
cv=k,
refit=True,
scoring='neg_brier_score',
solver='saga',
random_state=42,
n_jobs=-1
)
lasso_model_auc = LogisticRegressionCV(
Cs=C_values,
penalty='l1',
cv=k,
refit=True,
scoring='roc_auc',
solver='saga',
random_state=42,
n_jobs=-1
)
# The training time is not the sum of the training with the two metrics - it is just a technical detail
# that LogisticRegressionCV is cannot estimate two scoring metrics at once.
start_time = time()
lasso_brier = lasso_model_brier.fit(X_train_lasso_scaled, y_train_lasso)
timespan = time() - start_time
model_times['lasso_logit'] = timespan
lasso_auc = lasso_model_auc.fit(X_train_lasso_scaled, y_train_lasso)
I export the results for later use.
if not os.path.exists('model_dills'):
os.makedirs('model_dills')
with open("model_times.json", "w") as f:
json.dump(model_times, f)
with open(Path("model_dills", "lasso_brier.dill"), "wb") as f:
dill.dump(lasso_brier, f)
with open(Path("model_dills", "lasso_auc.dill"), "wb") as f:
dill.dump(lasso_auc, f)
For the probability forest model, I defined a 4x4 tuning grid based on prior experiences. The evaluation criterion for doing the splits while training the trees was set to the Gini impurity score, as instructed by the case study. During the cross-validated grid search, I calculated both AUC and negative Brier-score metrics. The best model for refitting on the whole training set was found based on the AUC score. I have chosen to refit the model based on AUC as the end goal of the exercise was prediction.
As the number of input features is 122, the theoretically optimal number of maximum features is the square root of this, that is around 11, so I built my tuning grid around this intuition.
grid = {
"max_features": [9, 11, 13, 15],
"criterion": ["gini"],
"min_samples_split": [10, 15, 20, 25],
}
prob_forest = RandomForestClassifier(
random_state=42,
n_estimators=500,
oob_score=True,
n_jobs=-1
)
prob_forest_grid = GridSearchCV(
prob_forest,
grid,
cv=k,
refit="roc_auc",
scoring=["roc_auc", "neg_brier_score"],
verbose=0,
n_jobs=-1
)
start_time = time()
prob_forest_fit = prob_forest_grid.fit(X_train_tree, y_train_tree)
timespan = time() - start_time
model_times['prob_forest'] = timespan
Again, I export the results.
with open("model_times.json", "w") as f:
json.dump(model_times, f)
with open(Path("model_dills", "prob_forest.dill"), "wb") as f:
dill.dump(prob_forest_fit, f)
Training the probability XGBoost model was the most cumbersome. As XGBoost has quite a high number of tunable hyperparameters, I decided to tune them in two steps. First, I defined a larger tuning grid, which has been evaluated by HalvingGridSearchCV, only 100 boosting rounds and only 3-fold cross-validation. HalvingGridSearchCV starts cross-validation by only a fraction of the data and eliminates the worse-performing hyperparameters in each iteration. Then, the smaller and smaller hyperparameter grids are evaluated on more and more data. This approach speeds up training significantly.
Note that in HalvingGridSearchCV, as resource is n_samples by default, and cv uses no shuffle, there is no randomness, thus the results are reproducible (as per the docs).
Second, I looked at the best hyperparameters found by this approach, and I defined a smaller tuning grid based on this. This grid was then evaluated by regular grid search with 5-fold cross-validation and 500 boosting rounds.
Note that XGBClassifier does not natively support the Gini impurity score as an evaluation metric, thus the splits were evaluated based on RMSE - which should theoretically yield the same results as the Gini impurity score.
grid_xgb = {
"max_depth": [4, 6, 8, 10],
"colsample_bytree": [0.2, 0.3, 0.5, 0.7],
"learning_rate": [0.03, 0.05, 0.1],
"min_child_weight": [5, 10, 15],
"subsample": [0.2, 0.6, 1],
"gamma": [0, 0.1, 0.3],
}
prob_xgb = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse', #Gini is not available in XGBoost, but RMSE should yield the same results
n_estimators=100,
random_state=42,
n_jobs=-1,
verbose=0
)
prob_xgb_grid = HalvingGridSearchCV(
prob_xgb,
grid_xgb,
cv=3,
factor=3,
refit=False,
scoring='roc_auc',
verbose=0,
n_jobs=-1
)
start_time = time()
prob_xgb_fit = prob_xgb_grid.fit(X_train_tree, y_train_tree)
timespan = time() - start_time
model_times['prob_xgb'] = timespan
The best hyperparameters found by the HalvingGridSearchCV are the following:
{'colsample_bytree': 0.2,
'gamma': 0.1,
'learning_rate': 0.03,
'max_depth': 10,
'min_child_weight': 5,
'subsample': 0.6}
grid_xgb = {
"max_depth": [10, 12],
"colsample_bytree": [0.1, 0.2],
"learning_rate": [0.02, 0.03],
"min_child_weight": [3, 5],
"subsample": [0.6, 0.8],
"gamma": [0, 0.1],
}
prob_xgb = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse', #Gini is not available in XGBoost, but RMSE should yield the same results
n_estimators=500, # increased number of boosting rounds
random_state=42,
n_jobs=-1,
verbose=0
)
prob_xgb_grid = GridSearchCV(
prob_xgb,
grid_xgb,
cv=k, # using the same 5-fold cross-validation as for the other models
refit='roc_auc', # refitting the model with the best parameters for ROC AUC
scoring=['roc_auc', 'neg_brier_score'], # calculating both neeeded metrics
verbose=0,
n_jobs=-1
)
start_time = time()
prob_xgb_fit = prob_xgb_grid.fit(X_train_tree, y_train_tree)
timespan += time() - start_time # adding the time for the second grid search
model_times['prob_xgb'] = timespan
I export the XGBoost results as well.
with open("model_times.json", "w") as f:
json.dump(model_times, f)
with open(Path("model_dills", "prob_xgb.dill"), "wb") as f:
dill.dump(prob_xgb_fit, f)
2.3. Model performance based on cross-validated results¶
To evaluate the models based on their cross-validated performance, I import the trained models saved previously.
with open("model_times.json", "r", encoding="utf-8") as f:
model_times = json.load(f)
with open(Path("model_dills", "lasso_brier.dill"), "rb") as f:
lasso_brier = dill.load(f)
with open(Path("model_dills", "lasso_auc.dill"), "rb") as f:
lasso_auc = dill.load(f)
with open(Path("model_dills", "prob_forest.dill"), "rb") as f:
prob_forest = dill.load(f)
with open(Path("model_dills", "prob_xgb.dill"), "rb") as f:
prob_xgb = dill.load(f)
First, I will look at how the different models performed on their different hyperparameters. As forthe LASSO logit model, we can see that the cross-validation could find a locally optimal lambda (C-value). Moreover, we can see that the two scoring metrics - RMSE and AUC - agree on which lambda is the best. The concrete RMSE and AUC values will be presented later.
As mentioned before, the LASSO logit model had 602 input features in total, of which 180 were selected by the algorithm (and thus 422 coefficients were shrunk to zero).
cv_summary_lasso = cv_summary(lambdas, C_values, lasso_brier)
cv_summary_lasso['mean_cv_score'] = np.sqrt(cv_summary_lasso['mean_cv_score'] * -1)
cv_lasso_auc = cv_summary(lambdas, C_values, lasso_auc)
cv_lasso_auc.rename(columns={'mean_cv_score': 'mean_cv_auc'}, inplace=True)
cv_summary_lasso = cv_summary_lasso.merge(cv_lasso_auc, on=['lambdas', 'C_values'])
cv_summary_lasso.columns = ['Lambda', 'C value', 'Mean CV RMSE', 'Mean CV AUC']
pw.basefigure.clear()
p1 = (
ggplot(cv_summary_lasso, aes(x="Lambda", y="Mean CV RMSE"))
+ geom_point(fill = color[0], color = color[1], size=4)
+ scale_y_continuous(limits=(0.31, 0.34), breaks=seq(0.31, 0.34, 0.01))
+ labs(title = 'LASSO logit - CV mean test RMSE by lambda')
+ theme_bw()
)
p2 = (
ggplot(cv_summary_lasso, aes(x="Lambda", y="Mean CV AUC"))
+ geom_point(fill = color[0], color = color[1], size=4)
+ scale_y_continuous(limits=(0.5, 0.75), breaks=seq(0.5, 0.75, 0.05))
+ labs(title = 'LASSO logit - CV mean test AUC by lambda')
+ theme_bw()
)
f = pw.load_ggplot(p1) | pw.load_ggplot(p2)
display(f)
print(f'Number of input features for LASSO logit: {X_train_lasso_scaled.shape[1]}')
print(f'Number of selected features by LASSO logit: {np.sum(lasso_auc.coef_ != 0)}')
print(f'Number of coefficients shrunk to zero: {np.sum(lasso_auc.coef_ == 0)}')
Number of input features for LASSO logit: 602 Number of selected features by LASSO logit: 180 Number of coefficients shrunk to zero: 422
As we can see, the best hyperparameter combination chosen by RMSE and AUC is exactly the same for the probability forest. Also note that while that while the probability forest slightly outperforms the LASSO logit in terms of RMSE, it actually scores lower in terms of AUC.
Also, we can observe that the optimal hyperparameter set by AUC falls in the middle of the tuning grid. This suggests that we found a locally optimal value in the hyperparameter space.
Note: on both heatmaps, dark purple colors indicate better results (lower RMSE or higher AUC).
heatmap_data_rmse = pd.DataFrame(prob_forest.cv_results_)[
["param_max_features", "param_min_samples_split", "mean_test_neg_brier_score"]
].assign(
mean_test_score=lambda x: np.sqrt(x["mean_test_neg_brier_score"] * -1),
Variables=lambda x: x["param_max_features"],
Min_nodes=lambda x: x["param_min_samples_split"],
).pivot(
index="Min_nodes", columns="Variables", values="mean_test_score"
).round(5)
heatmap_data_auc = pd.DataFrame(prob_forest.cv_results_)[
["param_max_features", "param_min_samples_split", "mean_test_roc_auc"]
].assign(
mean_test_score=lambda x: x["mean_test_roc_auc"],
Variables=lambda x: x["param_max_features"],
Min_nodes=lambda x: x["param_min_samples_split"],
).pivot(
index="Min_nodes", columns="Variables", values="mean_test_score"
).round(5)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(heatmap_data_rmse, annot=True, cmap="inferno", fmt=".5f", linewidths=0.5, ax=axes[0])
axes[0].set_xlabel("Max. features")
axes[0].set_ylabel("Min. samples split")
axes[0].set_title("Probability forest tuning grid results (CV RMSE)")
sns.heatmap(heatmap_data_auc, annot=True, cmap="inferno_r", fmt=".5f", linewidths=0.5, ax=axes[1])
axes[1].set_xlabel("Max. features")
axes[1].set_ylabel("Min. samples split")
axes[1].set_title("Probability forest tuning grid results (CV AUC)")
plt.tight_layout()
plt.show()
For the probability XGBoost, we can see a discrepancy between the best models by RMSE and AUC. Nonetheless, the best model by AUC is still among the best by RMSE (to be precise, the 6th best). As our end-goal is classification, I will rely on the AUC values. Also, note that XGBoost means a notable improvement compared to the LASSO logit and the probability forest, both in terms of RMSE and AUC.
xgb_grid_results = pd.DataFrame(prob_xgb.cv_results_)[[
"mean_test_neg_brier_score",
"mean_test_roc_auc",
"param_colsample_bytree",
"param_learning_rate",
"param_max_depth",
"param_min_child_weight",
"param_subsample",
"param_gamma"
]].astype(float)
xgb_grid_results.mean_test_neg_brier_score = np.sqrt(xgb_grid_results.mean_test_neg_brier_score * -1)
xgb_grid_results.sort_values(by = 'mean_test_roc_auc', ascending = False, inplace = True)
xgb_grid_results.columns = ['CV RMSE', 'CV AUC', 'Col. subsample ratio', 'Learning rate', 'Max. depth', 'Min. child weight', 'Subsampling ratio', 'Gamma']
xgb_grid_results.reset_index(drop = True, inplace = True)
print('Best performing 5 hyperparameter combinations for XGBoost (by AUC):')
display(xgb_grid_results.head(5))
print('\nWorst performing 5 hyperparameter combinations for XGBoost (by AUC):')
display(xgb_grid_results.tail(5))
Best performing 5 hyperparameter combinations for XGBoost (by AUC):
| CV RMSE | CV AUC | Col. subsample ratio | Learning rate | Max. depth | Min. child weight | Subsampling ratio | Gamma | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.309449 | 0.764710 | 0.2 | 0.02 | 10.0 | 3.0 | 0.6 | 0.0 |
| 1 | 0.308874 | 0.764518 | 0.2 | 0.02 | 10.0 | 5.0 | 0.8 | 0.1 |
| 2 | 0.311077 | 0.764476 | 0.2 | 0.02 | 12.0 | 3.0 | 0.8 | 0.1 |
| 3 | 0.309404 | 0.764417 | 0.2 | 0.02 | 12.0 | 5.0 | 0.8 | 0.0 |
| 4 | 0.309363 | 0.764306 | 0.2 | 0.02 | 10.0 | 3.0 | 0.6 | 0.1 |
Worst performing 5 hyperparameter combinations for XGBoost (by AUC):
| CV RMSE | CV AUC | Col. subsample ratio | Learning rate | Max. depth | Min. child weight | Subsampling ratio | Gamma | |
|---|---|---|---|---|---|---|---|---|
| 59 | 0.311351 | 0.753257 | 0.1 | 0.02 | 12.0 | 3.0 | 0.6 | 0.0 |
| 60 | 0.312124 | 0.753148 | 0.1 | 0.03 | 12.0 | 5.0 | 0.8 | 0.0 |
| 61 | 0.311274 | 0.753060 | 0.1 | 0.03 | 12.0 | 5.0 | 0.6 | 0.0 |
| 62 | 0.312859 | 0.752495 | 0.1 | 0.03 | 12.0 | 3.0 | 0.6 | 0.0 |
| 63 | 0.311865 | 0.750588 | 0.1 | 0.03 | 12.0 | 5.0 | 0.6 | 0.1 |
Now let's compare the cross-validated performance metrics for the three models. As AUC and RMSE basically tell the same story, I will concentrate on AUC only. What we can see is that the LASSO logit and the probability forest models are practically indistinguishable from each other in terms of their AUC score. However, the probability XGBoost seems to slightly outperform both of these models. But this slightly increased performance of XGBoost comes at the cost of a substantially higher training time: it took roughly 10 times more to train the XGBoost model than the LASSO logit, and 1.4 times more than for the probability forest.
All in all, the cross-validated performance metrics suggest that if our goal was simply to predict probabilities, then we should choose the probability XGBoost model. However, if for some regulatory reasons we are not allowed to use black-box models, then the LASSO logit is already a good enough approach.
cv_all_results = pd.DataFrame({
'LASSO logit' : {
'CV RMSE' : round(cv_summary_lasso['Mean CV RMSE'].min(), 4),
'CV AUC' : round(cv_summary_lasso['Mean CV AUC'].max(), 4),
'Time (min)' : round(model_times['lasso_logit'] / 60, 2)
},
'Prob. forest' : {
'CV RMSE' : round(np.sqrt(prob_forest.cv_results_['mean_test_neg_brier_score'][prob_forest.cv_results_['mean_test_roc_auc'].argmax()]*-1), 4),
'CV AUC' : round(prob_forest.cv_results_['mean_test_roc_auc'].max(), 4),
'Time (min)' : round(model_times['prob_forest'] / 60, 2)
},
'Prob. XGB' : {
'CV RMSE' : round(np.sqrt(prob_xgb.cv_results_['mean_test_neg_brier_score'][prob_xgb.cv_results_['mean_test_roc_auc'].argmax()]*-1), 4),
'CV AUC' : round(prob_xgb.cv_results_['mean_test_roc_auc'].max(), 4),
'Time (min)' : round(model_times['prob_xgb'] / 60, 2)
}
}).T
cv_all_results
| CV RMSE | CV AUC | Time (min) | |
|---|---|---|---|
| LASSO logit | 0.3121 | 0.7483 | 2.42 |
| Prob. forest | 0.3111 | 0.7468 | 17.93 |
| Prob. XGB | 0.3094 | 0.7647 | 24.50 |
2.4. Model performance on hold-out set¶
Looking at the hold-out set results, we can see that there is an insignificant change in performance for all three models in both metrics. This suggests that the models are rather properly tuned and they do not overfit the training data - but there way be some room for improvement. We can also see that on the hold-out set, it is still the probability XGBoost that performs the best, both in terms of AUC and RMSE. Thus we have strong reasons to believe that if our goal was doing simple probability predictions, then XGBoost would be the most appropriate model to do that.
y_hold_lasso, X_hold_lasso = patsy.dmatrices(lasso_model_equation, data_holdout)
X_hold_lasso_scaled = pd.DataFrame(lasso_scaler.transform(X_hold_lasso), columns=X_hold_lasso.design_info.column_names)
y_hold_tree = pd.DataFrame(data_holdout['b_fast_growth'])
X_hold_tree = pd.get_dummies(data_holdout[tree_vars], drop_first=False)
X_hold_tree.columns = [col.replace("[", "(").replace("]", ")").replace("<", "") for col in X_hold_tree.columns]
for c in [c for c in X_hold_tree.columns if c.startswith('f_')]:
X_hold_tree[c] = X_hold_tree[c].astype(int)
hold_out_predictions = pd.DataFrame({
'b_fast_growth_true' : y_hold_lasso.ravel(),
'lasso_pred' : lasso_auc.predict_proba(X_hold_lasso_scaled)[:, 1],
'rf_pred' : prob_forest.predict_proba(X_hold_tree)[:, 1],
'xgb_pred' : prob_xgb.predict_proba(X_hold_tree)[:, 1]
})
hold_out_metrics = pd.DataFrame({
'LASSO logit' : {
'Hold-out RMSE' : mean_squared_error(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['lasso_pred'], squared=False).round(4),
'Hold-out AUC' : roc_auc_score(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['lasso_pred']).round(4)
},
'Prob. forest' : {
'Hold-out RMSE' : mean_squared_error(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['rf_pred'], squared=False).round(4),
'Hold-out AUC' : roc_auc_score(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['rf_pred']).round(4)
},
'Prob. XGB' : {
'Hold-out RMSE' : mean_squared_error(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['xgb_pred'], squared=False).round(4),
'Hold-out AUC' : roc_auc_score(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['xgb_pred']).round(4)
}
}).T
prob_summary = cv_all_results.merge(hold_out_metrics, left_index=True, right_index=True)
prob_summary = prob_summary.filter(['CV RMSE', 'Hold-out RMSE', 'CV AUC', 'Hold-out AUC', 'Time (min)'])
prob_summary
| CV RMSE | Hold-out RMSE | CV AUC | Hold-out AUC | Time (min) | |
|---|---|---|---|---|---|
| LASSO logit | 0.3121 | 0.3118 | 0.7483 | 0.7444 | 2.42 |
| Prob. forest | 0.3111 | 0.3097 | 0.7468 | 0.7456 | 17.93 |
| Prob. XGB | 0.3094 | 0.3079 | 0.7647 | 0.7555 | 24.50 |
We can even visualize the AUC values on the hold-out set with the corresponding ROC-curves. As the actual AUC values were not quite different in magnitude, the ROC-curves look more or less the same.
pw.basefigure.clear()
p1 = create_roc_plot(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['lasso_pred']) + labs(title='LASSO logit')
p2 = create_roc_plot(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['rf_pred']) + labs(title='Probability forest')
p3 = create_roc_plot(hold_out_predictions['b_fast_growth_true'], hold_out_predictions['xgb_pred']) + labs(title='Probability XGBoost')
f = pw.load_ggplot(p1) | pw.load_ggplot(p2) | pw.load_ggplot(p3)
display(f)
A more interesting diagnostic plot to look at is the calibration plot. These show that for each 0.1 unit wide bin of predicted probability, what share of the actual observations fall into the fast-growth category. From the below plots, we can see that both the LASSO logit and the probability forest models are rather badly calibrated: the former overpre-dicts relatively high probabilities, while the latter usually underpredicts them. The best-calibrated model seems to be the probability XGBoost, as it stays very close to the diagonal even in the case of high probabilities.
pw.basefigure.clear()
p1 = create_calibration_plot(
hold_out_predictions,
prob_var="lasso_pred",
actual_var="b_fast_growth_true",
n_bins=10,
breaks=None,
) + labs(title="LASSO logit")
p2 = create_calibration_plot(
hold_out_predictions,
prob_var="rf_pred",
actual_var="b_fast_growth_true",
n_bins=10,
breaks=None,
) + labs(title="Probability forest")
p3 = create_calibration_plot(
hold_out_predictions,
prob_var="xgb_pred",
actual_var="b_fast_growth_true",
n_bins=10,
breaks=None,
) + labs(title="Probability XGBoost")
f = pw.load_ggplot(p1) | pw.load_ggplot(p2) | pw.load_ggplot(p3)
display(f)
To examine to what extent miscalibration for higher predicted probabilities might be an issue, we can take a look at predicted probability distributions conditional on the actual value of the target variable. First, note that for all three models, very high predicted probabilities are quite rare - this means that on the above calibration plots, the higher bins have relatively few observations, which is a natural source of deviations from the 45-degree line. Second, we can see that there is a substantial overlap between the two density plots, indicating that the models struggle to separate the two classes based on predicted probabilities. This overlap suggests that while the models do assign higher probabilities to instances of fast growth compared to no fast growth, they do not do so with a high degree of confidence.
hold_out_predictions['outcome_cat'] = hold_out_predictions['b_fast_growth_true'].apply(lambda x: 'Fast growth' if x == 1 else 'No fast growth')
pw.basefigure.clear()
p1 = (
ggplot(hold_out_predictions, aes(x="lasso_pred", fill='outcome_cat'))
+ geom_density(alpha=0.3)
+ theme_bw()
+ labs(title="LASSO logit",
x="Predicted probability",
y="Density",
fill="True category")
+ scale_fill_manual(values=color)
+ scale_x_continuous(limits=(0, 1), breaks=seq(0, 1.1, 0.1))
+ scale_y_continuous(limits=(0, 12), breaks=seq(0, 12, 2))
+ theme(legend_position=(0.90, 0.90), legend_direction='vertical')
)
p2 = (
ggplot(hold_out_predictions, aes(x="rf_pred", fill='outcome_cat'))
+ geom_density(alpha=0.3)
+ theme_bw()
+ labs(title="Probability forest",
x="Predicted probability",
y="Density",
fill="True category")
+ scale_fill_manual(values=color)
+ scale_x_continuous(limits=(0, 1), breaks=seq(0, 1.1, 0.1))
+ scale_y_continuous(limits=(0, 12), breaks=seq(0, 12, 2))
+ theme(legend_position=(0.90, 0.90))
)
p3 = (
ggplot(hold_out_predictions, aes(x="xgb_pred", fill='outcome_cat'))
+ geom_density(alpha=0.3)
+ theme_bw()
+ labs(title="Probability XGBoost",
x="Predicted probability",
y="Density",
fill="True category")
+ scale_fill_manual(values=color)
+ scale_x_continuous(limits=(0, 1), breaks=seq(0, 1.1, 0.1))
+ scale_y_continuous(limits=(0, 12), breaks=seq(0, 12, 2))
+ theme(legend_position=(0.90, 0.90))
)
f = pw.load_ggplot(p1) | pw.load_ggplot(p2) | pw.load_ggplot(p3)
display(f)
As noted earlier, the probability XGBoost model seems to be the best performing among the three for probability predictions. Therefore, let's spend some time understanding and unpacking how this model arrives at its results. For this, we can draw up a traditional variable importance plot. Note that the ordering of this plot does not match exactly that of the SHAP beeswarm plot (see below). This is because the variable importances are calculated as average degradation in the AUC score when the values of a certain feature are shuffled. On the contrary, the beeswarm plot is ordered simply by the mean absolute SHAP values for each feature. Thus these produce different results, though they usually identify the same features as the most important.
The variable importance plot suggests that the most important predictor of fast growth is the age of the firm, closely followed by different sales related features. In addition, personnel and material expenditures, as well as the firm's industry also plays an important role in the predictions. Together, the first 5 most important variables have a joint importance of 53%, while the first 10 most important variables 75%. If go up to the 20 most important variables, there sum of importance is 95%. Thus, if we were to put this model into production, we would only need about 20 features to produce mostly accurate results (and not the all 122 we fed into the model).
numerical_columns = [col for col in tree_vars if col not in categories]
categorical_encoder = OneHotEncoder(handle_unknown="ignore")
preprocessing = ColumnTransformer(
[
("cat", categorical_encoder, categories),
("num", "passthrough", numerical_columns),
]
)
xgb_best_pipeline = Pipeline(
[
("preprocess", preprocessing),
("regressor", XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse',
n_estimators=500,
random_state=42,
n_jobs=-1,
verbose=0,
colsample_bytree=prob_xgb.best_params_['colsample_bytree'],
gamma=prob_xgb.best_params_['gamma'],
learning_rate=prob_xgb.best_params_['learning_rate'],
max_depth=prob_xgb.best_params_['max_depth'],
min_child_weight=prob_xgb.best_params_['min_child_weight'],
subsample=prob_xgb.best_params_['subsample']
)),
]
)
xgb_best_pipeline.fit(data_train[tree_vars], data_train.b_fast_growth)
xgb_result = permutation_importance(
xgb_best_pipeline,
data_holdout[tree_vars],
data_holdout.b_fast_growth,
n_repeats=10,
scoring='roc_auc',
random_state=42,
n_jobs=-1,
)
xgb_grouped_imp = (
pd.DataFrame(xgb_result.importances_mean, data_holdout[tree_vars].columns)
.reset_index()
.rename({"index": "varname", 0: "imp"}, axis=1)
.assign(imp_percentage=lambda x: x["imp"] / x["imp"].sum())
.sort_values(by=["imp"], ascending=False)
)
varimp_plot = (
ggplot(
xgb_grouped_imp.loc[lambda x: x.imp_percentage > 0.01],
aes(x="reorder(varname, imp)", y="imp_percentage"),
)
+ geom_point(color=color[0], size=2.5)
+ geom_segment(
aes(x="varname", xend="varname", y=0, yend="imp_percentage"),
color=color[0],
size=2,
)
+ ylab("Importance (%)")
+ xlab("Variable name")
+ coord_flip()
+ ggtitle("Prob. XGBoost - Var. importance (above 1% cutoff)")
+ scale_y_continuous(labels=percent_format())
+ theme_bw()
)
display(varimp_plot)
print(f'Summed importance of top 5 most important predictors: {xgb_grouped_imp.imp_percentage.head(5).sum()*100:.2f}%')
print(f'Summed importance of top 10 most important predictors: {xgb_grouped_imp.imp_percentage.head(10).sum()*100:.2f}%')
print(f'Summed importance of top 20 most important predictors: {xgb_grouped_imp.imp_percentage.head(20).sum()*100:.2f}%')
Summed importance of top 5 most important predictors: 53.28% Summed importance of top 10 most important predictors: 74.81% Summed importance of top 20 most important predictors: 94.55%
The above presented variable importance plot has an important shortcoming: we cannot really know the direction of the relationship between the most important predictors and the target variable. To examine this, I also calculated SHAP values (as probabilities) on the hold-out set to uncover the importance and effects of different features on the model output. The below beeswarm plot tells us that the most influential feature is sales, which tends to have a broad range of SHAP values, suggesting that it plays a crucial role in the model’s decision-making. Interestingly, age and personnel expenditures also exhibit a strong impact on predictions, with both showing that lower feature values contribute to higher predicted probabilities. Notably, the log changes in sales (in the previous two years) follow closely behind in importance, indicating that recent changes in sales significantly affect the probability predictions.
Whether high feature values contribute positively or negatively to predicted probabilities is quite self-explanatory for the firm age and personnel expenditure. According to the model, younger firms and those with lower personnel costs have a higher chance of being fast-growth in the next two years. However, it is interesting that high values of sales contribute generally negatively to predicted probabilities. This might be because of a baseline-effect: it is generally harder to achieve large percentage growth when the base value is already high. Also, it seems that past sales growth patterns are an important predictor of future sales growth - however, it is mixed in which way high or low values contribute to predicted probabilities - this may be indicative of some important interaction terms.
xgb_explainer = shap.TreeExplainer(prob_xgb.best_estimator_, data=X_hold_tree, model_output="probability")
xgb_shap_values = xgb_explainer(X_hold_tree)
100%|===================| 3806/3808 [03:50<00:00]
shap.plots.beeswarm(xgb_shap_values, max_display = 20, color=plt.get_cmap("inferno_r"), show = False)
plt.title("Hold-out SHAP values for probability XGBoost")
plt.show()
2.5. Classification - finding the optimal threshold¶
As I have already described in detail how I arrived at an appropriate loss function in the summary, I will now concentrate more on how the optimal thresholds were found. Finding the appropriate thresholds for each model consisted of the following steps:
- itarating through all of the models with the best cross-validated hyperparameters;
- defining the appropriate feature sets for each model;
- for each model, iterating through the 5 cross-validation folds
- within each fold, fitting a model on the train set and predicting probabilities on the test set;
- and then calculating the optimal threshold from the ROC curve values;
- and calculating the expected loss with the optimal threshold within the fold;
- after all folds, averaging out the optimal thresholds and expected loss values over the 5 folds.
FP = 1
FN = 5
cost = FN / FP
prevelance = float(np.sum(y_train_tree) / len(y_train_tree))
best_thresholds_cv = dict()
expected_loss_cv = dict()
fold5_threshold = dict()
fold5_expected_loss = dict()
fold5_all_coords = dict()
for model_name in ['lasso_logit', 'prob_forest', 'prob_xgb']:
best_thresholds = []
expected_loss = []
# setting appropriate RHS variables for different models
if model_name == 'lasso_logit':
X = X_train_lasso_scaled
else:
X = X_train_tree
fold = 0
# cross-validation
for train_index, test_index in k.split(X):
# train and test split for the fold
X_fold_test = X.iloc[test_index, :]
y_fold_test = data_train['b_fast_growth'].iloc[test_index]
X_fold_train = X.iloc[train_index, :]
y_fold_train = data_train['b_fast_growth'].iloc[train_index]
if model_name == 'lasso_logit':
estimator = LogisticRegression(
penalty='l1',
C=cv_summary_lasso['C value'].iloc[cv_summary_lasso['Mean CV AUC'].argmax()],
solver='saga',
random_state=42,
n_jobs=-1
)
elif model_name == 'prob_forest':
estimator = RandomForestClassifier(
random_state=42,
n_estimators=500,
oob_score=True,
n_jobs=-1,
criterion=prob_forest.best_params_['criterion'],
max_features=prob_forest.best_params_['max_features'],
min_samples_split=prob_forest.best_params_['min_samples_split']
)
else:
estimator = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse',
n_estimators=500,
random_state=42,
n_jobs=-1,
verbose=0,
colsample_bytree=prob_xgb.best_params_['colsample_bytree'],
gamma=prob_xgb.best_params_['gamma'],
learning_rate=prob_xgb.best_params_['learning_rate'],
max_depth=prob_xgb.best_params_['max_depth'],
min_child_weight=prob_xgb.best_params_['min_child_weight'],
subsample=prob_xgb.best_params_['subsample']
)
# fit model on fold
fitted_estimator = estimator.fit(X_fold_train, y_fold_train)
# make prediction on fold test data
pred_fold = fitted_estimator.predict_proba(X_fold_test)[:, 1]
#calculate ROC curve values to get optimal threshold and expected loss
false_pos_rate, true_pos_rate, thresholds = roc_curve(y_fold_test, pred_fold)
optimal_threshold = sorted(
list(
zip(
np.abs(true_pos_rate + (1 - prevelance) / (cost * prevelance) * (1 - false_pos_rate)),
thresholds,
)
),
key=lambda i: i[0],
reverse=True,
)[0][1]
best_thresholds.append(optimal_threshold)
threshold_prediction = np.where(pred_fold < optimal_threshold, 0, 1)
tn, fp, fn, tp = confusion_matrix(
y_fold_test, threshold_prediction, labels=[0, 1]
).ravel()
curr_exp_loss = (fp * FP + fn * FN) / len(y_fold_test)
expected_loss.append(curr_exp_loss)
fold = fold + 1
# CV mean values for threshold and loss
best_thresholds_cv[model_name] = np.mean(best_thresholds)
expected_loss_cv[model_name] = np.mean(expected_loss)
# 5th fold values
fold5_threshold[model_name] = optimal_threshold
fold5_expected_loss[model_name] = curr_exp_loss
all_coords = pd.DataFrame(
{
"false_pos": false_pos_rate * sum(y_fold_test == 0),
"true_pos": true_pos_rate * sum(y_fold_test == 1),
"false_neg": sum(y_fold_test == 1) - true_pos_rate * sum(y_fold_test == 1),
"true_neg": sum(y_fold_test == 0) - false_pos_rate * sum(y_fold_test == 0),
"pos": sum(y_fold_test == 1),
"neg": sum(y_fold_test == 0),
"n": len(y_fold_test),
"thresholds": thresholds,
}
)
fold5_all_coords[model_name] = all_coords
The below table presents the average of optimal thresholds and expected loss values over the 5 folds, as well as these values in the 5th fold only. The most important of the below metrics is the cross-validated average expected loss value. This tells us, that smallest expected loss is associated with the XGBoost model. Interestingly, the worse model in terms of expected loss seems to be the probability forest - even the LASSO logit model outperforms it. Note however, that the differences between the models in cross-validated expected loss are rather tiny.
summary_with_lossfnc = pd.DataFrame(best_thresholds_cv.items(), columns=["Model_raw", "Avg. of optimal thresholds"])
summary_with_lossfnc["Threshold for 5th fold"] = fold5_threshold.values()
summary_with_lossfnc["Avg. expected loss"] = expected_loss_cv.values()
summary_with_lossfnc["Expected loss for 5th fold"] = fold5_expected_loss.values()
summary_with_lossfnc['Model'] = ['LASSO logit', 'Prob. forest', 'Prob. XGB']
summary_with_lossfnc.drop(columns=['Model_raw'], inplace=True)
summary_with_lossfnc = summary_with_lossfnc.filter(['Model'] + list(summary_with_lossfnc.columns)[:-1])
summary_with_lossfnc
| Model | Avg. of optimal thresholds | Threshold for 5th fold | Avg. expected loss | Expected loss for 5th fold | |
|---|---|---|---|---|---|
| 0 | LASSO logit | 0.186800 | 0.150095 | 0.448645 | 0.450246 |
| 1 | Prob. forest | 0.207618 | 0.204965 | 0.453308 | 0.454516 |
| 2 | Prob. XGB | 0.136336 | 0.134617 | 0.437547 | 0.422332 |
To give us an idea of the expected loss for different thresholds, we can take a look at loss-plots over the 5th fold. These generally reveal that expected loss is the highest when we pick a threshold far below the optimal one, while grows gradually the higher threshold we pick relative to the optimal one.
pw.basefigure.clear()
p1 = (
create_loss_plot(
fold5_all_coords['lasso_logit'],
fold5_threshold['lasso_logit'],
fold5_expected_loss['lasso_logit'])
+ labs(title = 'LASSO logit')
+ scale_y_continuous(limits=(0.4, 0.9), breaks=seq(0.4, 0.9, 0.1))
)
p2 = (
create_loss_plot(
fold5_all_coords['prob_forest'],
fold5_threshold['prob_forest'],
fold5_expected_loss['prob_forest'])
+ labs(title = 'Probability forest')
+ scale_y_continuous(limits=(0.4, 0.9), breaks=seq(0.4, 0.9, 0.1))
)
p3 = (
create_loss_plot(
fold5_all_coords['prob_xgb'],
fold5_threshold['prob_xgb'],
fold5_expected_loss['prob_xgb'])
+ labs(title = 'Probability XGBoost')
+ scale_y_continuous(limits=(0.4, 0.9), breaks=seq(0.4, 0.9, 0.1))
)
f = pw.load_ggplot(p1) | pw.load_ggplot(p2) | pw.load_ggplot(p3)
display(f)
We can also check out the ROC curves of the 5th fold, marking the optimal thresholds. These plots tell us that for the best model, that is the XGBoost, the optimal threshold over the 5th fold corresponds to a false positive rate of 17.9% and a true positive rate of 57.3%. Even though the LASSO logit model has a slightly higher true positive rate (58.8%), this comes at a cost of an increased false positive rate (22.2%). The lowest false positive rate is associated with the probability forest (15.7%), but as a result, its true positive rate is also the lowest (50.4%). This illustrates well the general trade-off between the false positive rate and the true positive rate.
pw.basefigure.clear()
p1 = create_roc_plot_with_optimal(fold5_all_coords['lasso_logit'], fold5_threshold['lasso_logit']) + labs(title='LASSO logit')
p2 = create_roc_plot_with_optimal(fold5_all_coords['prob_forest'], fold5_threshold['prob_forest']) + labs(title='Probability forest')
p3 = create_roc_plot_with_optimal(fold5_all_coords['prob_xgb'], fold5_threshold['prob_xgb']) + labs(title='Probability XGBoost')
f = pw.load_ggplot(p1) | pw.load_ggplot(p2) | pw.load_ggplot(p3)
display(f)
2.6. Classification performance on hold-out set¶
Hold-out set expected loss values indicate that probability XGBoost performed best with an expected loss of 0.4370, followed by LASSO logit at 0.4651, and the probability forest at 0.4669. The confusion matrices show that all models struggle with correctly identifying fast-growing cases, as they tend to misclassify a significant portion as non-fast growth. However, probability XGBoost correctly classified 266 fast-growing cases, the highest among the models.
LASSO logit classified 231 fast-growth cases correctly but misclassified 242, while probability forest performed the worst in this regard, correctly identifying only 222 cases and misclassifying 251. In contrast, probability forest had the lowest number of false positives (523), suggesting it was more conservative in predicting fast growth compared to the other models.
Overall, probability XGBoost exhibited the lowest expected loss. This model achieves an overall accuracy of approximately 78%, indicating that it correctly classifies a substantial majority of observations. The sensitivity, which measures the proportion of actual fast-growing firms that are correctly identified, is around 56%, suggesting that the model detects more than half of the fast-growing cases. Meanwhile, the specificity, which reflects the proportion of non-fast-growing firms correctly classified, stands at more than 81%, meaning the model effectively filters out non-fast-growing cases.
pred_dc = {
'lasso_logit' : {
'full_name' : 'LASSO logit',
'pred_col' : 'lasso_pred',
'threshold' : best_thresholds_cv['lasso_logit']
},
'prob_forest' : {
'full_name' : 'Prob. forest',
'pred_col' : 'rf_pred',
'threshold' : best_thresholds_cv['prob_forest']
},
'prob_xgb' : {
'full_name' : 'Prob. XGB',
'pred_col' : 'xgb_pred',
'threshold' : best_thresholds_cv['prob_xgb']
}
}
hold_out_loss = dict()
for name, dc in pred_dc.items():
print(f'Hold-out set classification results for: {dc["full_name"]}')
holdout_treshold = np.where(hold_out_predictions[dc['pred_col']] < dc['threshold'], 0, 1)
tn, fp, fn, tp = confusion_matrix(hold_out_predictions['b_fast_growth_true'], holdout_treshold, labels=[0, 1]).ravel()
expected_loss_holdout = (fp * FP + fn * FN) / len(hold_out_predictions['b_fast_growth_true'])
print(f'Expected loss on hold-out set: {expected_loss_holdout:.4f}')
hold_out_loss[dc['full_name']] = expected_loss_holdout
cm = confusion_matrix(hold_out_predictions['b_fast_growth_true'], holdout_treshold, labels=[0, 1])
cm = pd.DataFrame(
cm,
index=["Actul no fast growth", "Actual fast growth"],
columns=["Predicted no fast growth", "Predicted fast growth"],
)
print(f'Confusion matrix for the hold-out set:')
display(cm)
Hold-out set classification results for: LASSO logit Expected loss on hold-out set: 0.4651 Confusion matrix for the hold-out set:
| Predicted no fast growth | Predicted fast growth | |
|---|---|---|
| Actul no fast growth | 2774 | 561 |
| Actual fast growth | 242 | 231 |
Hold-out set classification results for: Prob. forest Expected loss on hold-out set: 0.4669 Confusion matrix for the hold-out set:
| Predicted no fast growth | Predicted fast growth | |
|---|---|---|
| Actul no fast growth | 2812 | 523 |
| Actual fast growth | 251 | 222 |
Hold-out set classification results for: Prob. XGB Expected loss on hold-out set: 0.4370 Confusion matrix for the hold-out set:
| Predicted no fast growth | Predicted fast growth | |
|---|---|---|
| Actul no fast growth | 2706 | 629 |
| Actual fast growth | 207 | 266 |
Before moving on to the subsample classification task, let's have a final look at all the calculated metrics during the analysis. From these, it seems that the probability XGBoost model consistently outperformed the rest of the models across all metrics, though its edge is rather small. There is one thing, however, that is the caveat in using the XGBoost model: its training time far exceeds the other models', thus it may not be the most suitable option in resource-scarce environments.
full_summary = summary_with_lossfnc.merge(
pd.DataFrame(hold_out_loss.items(), columns=['Model', 'Hold-out loss']),
on='Model'
).merge(
prob_summary,
left_on='Model',
right_index=True
)
full_summary = full_summary.filter(
['Model', 'Time (min)', 'CV RMSE', 'Hold-out RMSE', 'CV AUC', 'Hold-out AUC', 'Avg. of optimal thresholds', 'Avg. expected loss', 'Hold-out loss']
)
full_summary.rename(
columns={
'Avg. of optimal thresholds' : 'CV class. threshold',
'Avg. expected loss' : 'CV expected loss',
}, inplace=True
)
full_summary
| Model | Time (min) | CV RMSE | Hold-out RMSE | CV AUC | Hold-out AUC | CV class. threshold | CV expected loss | Hold-out loss | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LASSO logit | 2.42 | 0.3121 | 0.3118 | 0.7483 | 0.7444 | 0.186800 | 0.448645 | 0.465074 |
| 1 | Prob. forest | 17.93 | 0.3111 | 0.3097 | 0.7468 | 0.7456 | 0.207618 | 0.453308 | 0.466912 |
| 2 | Prob. XGB | 24.50 | 0.3094 | 0.3079 | 0.7647 | 0.7555 | 0.136336 | 0.437547 | 0.436975 |
2.7. Classification on subsamples - finding the optimal threshold¶
To perform the classification task on the different sectors separetly, I had to train two separete models on the two subsamples of the data. For this, I used the previously best-performin probability XGBoost set-up with the cross-validated hyperparameters.
Based on a quick Google search of the NACE system, I could identify that the 2-digit codes of 33 (repairs), 55 (hospitality) and 56 (restaurants) belong to the services industry, while the rest of the codes to manufacturing. Note that with this split, the service sector subsample is roughly 4 times as large as the manufacturing one. Also, the train-holdout split is now not as balanced as it was in the case of the full sample. Nonetheless, we should keep this split so that the subsample results remain comparable with the full sample ones (as the models are trained and validated on the same parts of the dataset).
data_train_serv = data_train[data_train['f_ind2_cat'].isin([33,55,56])]
data_train_manu = data_train[~data_train['f_ind2_cat'].isin([33,55,56])]
data_holdout_serv = data_holdout[data_holdout['f_ind2_cat'].isin([33,55,56])]
data_holdout_manu = data_holdout[~data_holdout['f_ind2_cat'].isin([33,55,56])]
print(f'Service sector:')
print(f' Number of observations in the training set: {len(data_train_serv)} ({(data_train_serv.b_fast_growth.sum() / data_train_serv.shape[0])*100:.2f}% fast growth)')
print(f' Number of observations in the hold-out set: {len(data_holdout_serv)} ({(data_holdout_serv.b_fast_growth.sum() / data_holdout_serv.shape[0])*100:.2f}% fast growth)')
print(f'\nManufacturing sector:')
print(f' Number of observations in the training set: {len(data_train_manu)} ({(data_train_manu.b_fast_growth.sum() / data_train_manu.shape[0])*100:.2f}% fast growth)')
print(f' Number of observations in the hold-out set: {len(data_holdout_manu)} ({(data_holdout_manu.b_fast_growth.sum() / data_holdout_manu.shape[0])*100:.2f}% fast growth)')
Service sector:
Number of observations in the training set: 12214 (12.56% fast growth)
Number of observations in the hold-out set: 3032 (11.77% fast growth)
Manufacturing sector:
Number of observations in the training set: 3014 (12.24% fast growth)
Number of observations in the hold-out set: 776 (14.95% fast growth)
After fitting the models on their respective training sets, I predicted probabilities for the respective hold-out sets to evaluate the models probability prediction performance. It turns out that the model is far more accurate on the services subsample with an RMSE of 0.299 and an AUC of 0.770, compared to the manufacturing RMSE of 0.343 and AUC of 0.701.
for d in [data_train_serv, data_train_manu, data_holdout_serv, data_holdout_manu]:
for col in ['f_ind', 'f_ind2_cat']:
d[col] = d[col].cat.remove_unused_categories()
best_xgb_serv = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse',
n_estimators=500,
random_state=42,
n_jobs=-1,
verbose=0,
colsample_bytree=prob_xgb.best_params_['colsample_bytree'],
gamma=prob_xgb.best_params_['gamma'],
learning_rate=prob_xgb.best_params_['learning_rate'],
max_depth=prob_xgb.best_params_['max_depth'],
min_child_weight=prob_xgb.best_params_['min_child_weight'],
subsample=prob_xgb.best_params_['subsample']
)
best_xgb_manu = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse',
n_estimators=500,
random_state=42,
n_jobs=-1,
verbose=0,
colsample_bytree=prob_xgb.best_params_['colsample_bytree'],
gamma=prob_xgb.best_params_['gamma'],
learning_rate=prob_xgb.best_params_['learning_rate'],
max_depth=prob_xgb.best_params_['max_depth'],
min_child_weight=prob_xgb.best_params_['min_child_weight'],
subsample=prob_xgb.best_params_['subsample']
)
def prepare_data(tree_vars, data):
y = pd.DataFrame(data['b_fast_growth'])
X = pd.get_dummies(data[tree_vars], drop_first=False)
X.columns = [col.replace("[", "(").replace("]", ")").replace("<", "") for col in X.columns]
for c in [c for c in X.columns if c.startswith('f_')]:
X[c] = X[c].astype(int)
return y, X
y_train_serv, X_train_serv = prepare_data(tree_vars, data_train_serv)
y_train_manu, X_train_manu = prepare_data(tree_vars, data_train_manu)
y_holdout_serv, X_holdout_serv = prepare_data(tree_vars, data_holdout_serv)
y_holdout_manu, X_holdout_manu = prepare_data(tree_vars, data_holdout_manu)
serv_fitted = best_xgb_serv.fit(X_train_serv, y_train_serv)
manu_fitted = best_xgb_manu.fit(X_train_manu, y_train_manu)
serv_pred = pd.DataFrame({
'b_fast_growth_true' : y_holdout_serv.values.ravel(),
'xgb_pred' : serv_fitted.predict_proba(X_holdout_serv)[:, 1]
})
manu_pred = pd.DataFrame({
'b_fast_growth_true' : y_holdout_manu.values.ravel(),
'xgb_pred' : manu_fitted.predict_proba(X_holdout_manu)[:, 1]
})
print('Service sector hold-out set results:')
print('RMSE:', mean_squared_error(serv_pred['b_fast_growth_true'], serv_pred['xgb_pred'], squared=False).round(4))
print('AUC:', roc_auc_score(serv_pred['b_fast_growth_true'], serv_pred['xgb_pred']).round(4), '\n')
print('Manufacturing sector hold-out set results:')
print('RMSE:', mean_squared_error(manu_pred['b_fast_growth_true'], manu_pred['xgb_pred'], squared=False).round(4))
print('AUC:', roc_auc_score(manu_pred['b_fast_growth_true'], manu_pred['xgb_pred']).round(4))
Service sector hold-out set results: RMSE: 0.2988 AUC: 0.7698 Manufacturing sector hold-out set results: RMSE: 0.3427 AUC: 0.7006
The difference in performance is also apparent from the hold-out set ROC-curves per sector: the curve of the manufacturing sector lies closer to the diagonal, indicating worse predictive performance.
pw.basefigure.clear()
p1 = create_roc_plot(serv_pred['b_fast_growth_true'], serv_pred['xgb_pred']) + labs(title='Service sector')
p2 = create_roc_plot(manu_pred['b_fast_growth_true'], manu_pred['xgb_pred']) + labs(title='Manufacturing sector')
f = pw.load_ggplot(p1) | pw.load_ggplot(p2)
display(f)
We can also look at the sectoral calibration curves on the hold-out set. It turns out that not only is the overall performance better in the service sector, but the model is also better calibrated for that subsample. In contrast, the manufacturing sector model has a tendency to underpredict relatively high probabilities - but remember, that the training set for manufacturing contained almost 3 percentage point lower share of fast growth firms as the manufacturing hold-out set, which may be a natural source of this miscalibration on the hold-out set.
pw.basefigure.clear()
p1 = create_calibration_plot(
serv_pred,
prob_var="xgb_pred",
actual_var="b_fast_growth_true",
n_bins=10,
breaks=None,
) + labs(title="Service sector")
p2 = create_calibration_plot(
manu_pred,
prob_var="xgb_pred",
actual_var="b_fast_growth_true",
n_bins=10,
breaks=None,
) + labs(title="Manufacturing sector")
f = pw.load_ggplot(p1) | pw.load_ggplot(p2)
display(f)
Having examined the probability prediction performance metrics, let's turn our attention towards classification. Similarly to earlier, I cross-validate the optimal classification threshold for each sector using 5-fold cross-validation. One minor change I had to make was to calculate the sectoral prevalence metric for each sector separetly.
sect_best_thresholds_cv = dict()
sect_expected_loss_cv = dict()
sect_fold5_threshold = dict()
sect_fold5_expected_loss = dict()
sect_fold5_all_coords = dict()
for model_name in ['service', 'manufacturing']:
best_thresholds = []
expected_loss = []
# setting appropriate RHS variables for different models
if model_name == 'service':
X = X_train_serv
else:
X = X_train_manu
#calculate prevelance for the sector
if model_name == 'service':
sect_prevelance = float(np.sum(y_train_serv) / len(y_train_serv))
else:
sect_prevelance = float(np.sum(y_train_manu) / len(y_train_manu))
fold = 0
# cross-validation
for train_index, test_index in k.split(X):
# train and test split for the fold
X_fold_test = X.iloc[test_index, :]
X_fold_train = X.iloc[train_index, :]
if model_name == 'service':
y_fold_test = data_train_serv['b_fast_growth'].iloc[test_index]
y_fold_train = data_train_serv['b_fast_growth'].iloc[train_index]
else:
y_fold_test = data_train_manu['b_fast_growth'].iloc[test_index]
y_fold_train = data_train_manu['b_fast_growth'].iloc[train_index]
estimator = XGBClassifier(
tree_method='hist',
objective='binary:logistic',
eval_metric='rmse',
n_estimators=500,
random_state=42,
n_jobs=-1,
verbose=0,
colsample_bytree=prob_xgb.best_params_['colsample_bytree'],
gamma=prob_xgb.best_params_['gamma'],
learning_rate=prob_xgb.best_params_['learning_rate'],
max_depth=prob_xgb.best_params_['max_depth'],
min_child_weight=prob_xgb.best_params_['min_child_weight'],
subsample=prob_xgb.best_params_['subsample']
)
# fit model on fold
fitted_estimator = estimator.fit(X_fold_train, y_fold_train)
# make prediction on fold test data
pred_fold = fitted_estimator.predict_proba(X_fold_test)[:, 1]
#calculate ROC curve values to get optimal threshold and expected loss
false_pos_rate, true_pos_rate, thresholds = roc_curve(y_fold_test, pred_fold)
optimal_threshold = sorted(
list(
zip(
np.abs(true_pos_rate + (1 - sect_prevelance) / (cost * sect_prevelance) * (1 - false_pos_rate)),
thresholds,
)
),
key=lambda i: i[0],
reverse=True,
)[0][1]
best_thresholds.append(optimal_threshold)
threshold_prediction = np.where(pred_fold < optimal_threshold, 0, 1)
tn, fp, fn, tp = confusion_matrix(
y_fold_test, threshold_prediction, labels=[0, 1]
).ravel()
curr_exp_loss = (fp * FP + fn * FN) / len(y_fold_test)
expected_loss.append(curr_exp_loss)
fold = fold + 1
# CV mean values for threshold and loss
sect_best_thresholds_cv[model_name] = np.mean(best_thresholds)
sect_expected_loss_cv[model_name] = np.mean(expected_loss)
# 5th fold values
sect_fold5_threshold[model_name] = optimal_threshold
sect_fold5_expected_loss[model_name] = curr_exp_loss
sect_all_coords = pd.DataFrame(
{
"false_pos": false_pos_rate * sum(y_fold_test == 0),
"true_pos": true_pos_rate * sum(y_fold_test == 1),
"false_neg": sum(y_fold_test == 1) - true_pos_rate * sum(y_fold_test == 1),
"true_neg": sum(y_fold_test == 0) - false_pos_rate * sum(y_fold_test == 0),
"pos": sum(y_fold_test == 1),
"neg": sum(y_fold_test == 0),
"n": len(y_fold_test),
"thresholds": thresholds,
}
)
sect_fold5_all_coords[model_name] = sect_all_coords
As expected, the cross-validated expected loss is also substantially smaller for the service sector model. This indicates that we can achieve a better classification performance in the service sector than in manufacturing.
sect_summary_with_lossfnc = pd.DataFrame(
sect_best_thresholds_cv.items(), columns=["Sector_raw", "Avg. of optimal thresholds"]
)
sect_summary_with_lossfnc["Threshold for 5th fold"] = sect_fold5_threshold.values()
sect_summary_with_lossfnc["Avg. expected loss"] = sect_expected_loss_cv.values()
sect_summary_with_lossfnc["Expected loss for 5th fold"] = sect_fold5_expected_loss.values()
sect_summary_with_lossfnc['Sector'] = ['Service', 'Manufacturing']
sect_summary_with_lossfnc.drop(columns=['Sector_raw'], inplace=True)
sect_summary_with_lossfnc = sect_summary_with_lossfnc.filter(['Sector'] + list(sect_summary_with_lossfnc.columns)[:-1])
sect_summary_with_lossfnc
| Sector | Avg. of optimal thresholds | Threshold for 5th fold | Avg. expected loss | Expected loss for 5th fold | |
|---|---|---|---|---|---|
| 0 | Service | 0.129757 | 0.123595 | 0.428688 | 0.425471 |
| 1 | Manufacturing | 0.132253 | 0.154358 | 0.469802 | 0.451827 |
Just as an illustration, I also visualized the loss-plots for the 5th fold of the cross-validation.
pw.basefigure.clear()
p1 = (
create_loss_plot(
sect_fold5_all_coords['service'],
sect_fold5_threshold['service'],
sect_fold5_expected_loss['service'])
+ labs(title = 'Service sector')
+ scale_y_continuous(limits=(0.4, 0.9), breaks=seq(0.4, 0.9, 0.1))
)
p2 = (
create_loss_plot(
sect_fold5_all_coords['manufacturing'],
sect_fold5_threshold['manufacturing'],
sect_fold5_expected_loss['manufacturing'])
+ labs(title = 'Manufacturing sector')
+ scale_y_continuous(limits=(0.4, 0.9), breaks=seq(0.4, 0.9, 0.1))
)
f = pw.load_ggplot(p1) | pw.load_ggplot(p2)
display(f)
I also plotted the ROC-curves with indicating the optimal threshold for the 5th fold. These suggest that the service sector model has a substantially higher true positive rate (62.2%, as compared to 52.0% of the manufacturing sector model), which comes at the cost of an only slightly increased false positive rate (21.7% as compared to the 17.5% of the manufacturing sector model).
pw.basefigure.clear()
p1 = create_roc_plot_with_optimal(sect_fold5_all_coords['service'], sect_fold5_threshold['service']) + labs(title = 'Service sector')
p2 = create_roc_plot_with_optimal(sect_fold5_all_coords['manufacturing'], sect_fold5_threshold['manufacturing']) + labs(title = 'Manufacturing sector')
f = pw.load_ggplot(p1) | pw.load_ggplot(p2)
display(f)
2.8. Classification performance on the hold-out sets¶
The classification results for the service and manufacturing models on the hold-out set indicate a notable difference in performance. The service model achieved a lower expected loss of 0.413, suggesting better overall predictive accuracy compared to the manufacturing model, which had a significantly higher expected loss of 0.549. The service model correctly classified 208 fast-growth cases while misclassifying 149, whereas the manufacturing model identified only 53 fast-growth cases correctly but misclassified 63. Note that the smaller dataset size for manufacturing most probably have contributed to its higher expected loss and reduced classification performance.
sect_pred_dc = {
'service' : {
'full_name' : 'Service',
'pred_df' : serv_pred,
'pred_col' : 'xgb_pred',
'threshold' : sect_best_thresholds_cv['service']
},
'manufacturing' : {
'full_name' : 'Manufacturing',
'pred_df' : manu_pred,
'pred_col' : 'xgb_pred',
'threshold' : sect_best_thresholds_cv['manufacturing']
}
}
sect_hold_out_loss = dict()
for name, dc in sect_pred_dc.items():
print(f'Hold-out set classification results for: {dc["full_name"]}')
holdout_treshold = np.where(dc['pred_df'][dc['pred_col']] < dc['threshold'], 0, 1)
tn, fp, fn, tp = confusion_matrix(dc['pred_df']['b_fast_growth_true'], holdout_treshold, labels=[0, 1]).ravel()
expected_loss_holdout = (fp * FP + fn * FN) / len(dc['pred_df']['b_fast_growth_true'])
print(f'Expected loss on hold-out set: {expected_loss_holdout:.4f}')
sect_hold_out_loss[dc['full_name']] = expected_loss_holdout
cm = confusion_matrix(dc['pred_df']['b_fast_growth_true'], holdout_treshold, labels=[0, 1])
cm = pd.DataFrame(
cm,
index=["Actul no fast growth", "Actual fast growth"],
columns=["Predicted no fast growth", "Predicted fast growth"],
)
print(f'Confusion matrix for the hold-out set:')
display(cm)
Hold-out set classification results for: Service Expected loss on hold-out set: 0.4129 Confusion matrix for the hold-out set:
| Predicted no fast growth | Predicted fast growth | |
|---|---|---|
| Actul no fast growth | 2168 | 507 |
| Actual fast growth | 149 | 208 |
Hold-out set classification results for: Manufacturing Expected loss on hold-out set: 0.5490 Confusion matrix for the hold-out set:
| Predicted no fast growth | Predicted fast growth | |
|---|---|---|
| Actul no fast growth | 549 | 111 |
| Actual fast growth | 63 | 53 |
Last but not least, let's compare the hold-out set expected loss results with the cross-validated expected losses. Interestingly, the hold-out set expected loss for the service sector model is actually marginally smaller than the cross-validated one, so we definetly do not have an overfitting issue. However, for the manufacturing model, the hold-out set expected loss is actually substantially higher than the cross-validated expected loss. This might mean that the model is considerably overfitted on the training data. This might come as no surprise, given that the manufacturing subsample was much smaller in size.
sect_full_summary = sect_summary_with_lossfnc.merge(
pd.DataFrame(sect_hold_out_loss.items(), columns=['Sector', 'Hold-out loss']),
on='Sector'
)
sect_full_summary
| Sector | Avg. of optimal thresholds | Threshold for 5th fold | Avg. expected loss | Expected loss for 5th fold | Hold-out loss | |
|---|---|---|---|---|---|---|
| 0 | Service | 0.129757 | 0.123595 | 0.428688 | 0.425471 | 0.412929 |
| 1 | Manufacturing | 0.132253 | 0.154358 | 0.469802 | 0.451827 | 0.548969 |